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Abstract 

We had previously shown that regularization principles lead to approximation schemes which are equivalent 
to networks with one layer of hidden units, called Regularization Networks. In particular we had discussed 
how standard smoothness functionals lead to a subclass of regularization networks, the well-known Radial 
Basis Functions approximation schemes. In this paper we show that regularization networks encompass 
a much broader range of approximation schemes, including many of the popular general additive models 
and some of the neural networks. In particular we introduce new classes of smoothness functionals that 
lead to different classes of basis functions. Additive splines as well as some tensor product splines can 
be obtained from appropriate classes of smoothness functionals. Furthermore, the same extension that 
leads from Radial Basis Functions (RBF) to Hyper Basis Functions (HBF) also leads from additive models 
to ridge approximation models, containing as special cases Breiman's hinge functions and some forms of 
Projection Pursuit Regression. We propose to use the term Generalized Regularization Networks for this 
broad class of approximation schemes that follow from an extension of regularization. In the probabilistic 
interpretation of regularization, the different classes of basis functions correspond to different classes of 
prior probabilities on the approximating function spaces, and therefore to different types of smoothness 
assumptions. In the final part of the paper, we show the relation between activation functions of the 
Gaussian and sigmoidal type by considering the simple case of the kernel G(x) = \x\. 

In summary, different multilayer networks with one hidden layer, which we collectively call Generalized 
Regularization Networks, correspond to different classes of priors and associated smoothness functionals 
in a classical regularization principle. Three broad classes are a) Radial Basis Functions that generalize 
into Hyper Basis Functions, b) some tensor product splines, and c) additive splines that generalize into 
schemes of the type of ridge approximation, hinge functions and one-hidden-layer perceptrons. 
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1 Introduction 

In recent papers we and others have argued that the 
task of learning from examples can be considered in 
many cases to be equivalent to multivariate function ap- 
proximation, that is, to the problem of approximating a 
smooth function from sparse data, the examples. The 
interpretation of an approximation scheme in terms of 
networks, and viceversa, has also been extensively dis- 
cussed (Barron and Barron, 1988; Poggio and Girosi, 
1989, 1990; Broomhead and Lowe, 1988). 

In a series of papers we have explored a specific, al- 
beit quite general, approach to the problem of function 
approximation. The approach is based on the recogni- 
tion that the ill-posed problem of function approxima- 
tion from sparse data must be constrained by assum- 
ing an appropriate prior on the class of approximating 
functions. Regularization techniques typically impose 
smoothness constraints on the approximating set of func- 
tions. It can be argued that some form of smoothness 
is necessary to allow meaningful generalization in ap- 
proximation type problems (Poggio and Girosi, 1989, 
1990). A similar argument can also be used in the case 
of classification where smoothness involves the classifica- 
tion boundaries rather than the input-output mapping 
itself. Our use of regularization, which follows the clas- 
sical technique introduced by Tikhonov (1963, 1977), 
identifies the approximating function as the minimizer 
of a cost functional that includes an error term and a 
smoothness functional, usually called a stabilizer. In the 
Bayesian interpretation of regularization the stabilizer 
corresponds to a smoothness prior, and the error term 
to a model of the noise in the data (usually Gaussian 
and additive). 

In Poggio and Girosi (1989) we showed that regular- 
ization principles lead to approximation schemes which 
are equivalent to networks with one "hidden" layer, 
which we call Regularization Networks (RN). In par- 
ticular, we described how a certain class of radial sta- 
bilizers - and the associated priors in the equivalent 
Bayesian formulation - lead to a subclass of regular- 
ization networks, the already-known Radial Basis Func- 
tions (Powell, 1987, 1990; Micchelli, 1986; Dyn, 1987) 
that we have extended to Hyper Basis Functions (Poggio 
and Girosi, 1990, 1990a). The regularization networks 
with radial stabilizers we studied include all the classi- 
cal one-dimensional as well as multidimensional splines 
and approximation techniques, such as radial and non- 
radial Gaussian or multiquadric functions. In Poggio and 
Girosi (1990, 1990a) we have extended this class of net- 
works to Hyper Basis Functions (HBF). In this paper we 
show that an extension of Regularization Networks, that 
we propose to call Generalized Regularization Networks 
(GRN), encompasses an even broader range of approx- 
imation schemes, including, in addition to HBF, tensor 
product splines, many of the general additive models, 
and some of the neural networks. 

The plan of the paper is as follows. We first discuss 
the solution of the variational problems of regularization 
in a rather general form. We then introduce three differ- 
ent classes of stabilizers - and the corresponding priors 
in the equivalent Bayesian interpretation - that lead to 



different classes of basis functions: the well-know radial 
stabilizers, tensor-product stabilizers, and the new addi- 
tive stabilizers that underlie additive splines of different 
types. It is then possible to show that the same exten- 
sion that leads from Radial Basis Functions to Hyper 
Basis Functions leads from additive models to ridge ap- 
proximation, containing as special cases Breiman's hinge 
functions (1992) and ridge approximations of the type 
of Projection Pursuit Regression (PPR) (Friedman and 
Stuezle, 1981; Huber, 1985). Simple numerical exper- 
iments are then described to illustrate the theoretical 
arguments. 

In summary, the chain of our arguments shows that 
ridge approximation schemes such as 
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where 



Mi/) = I>sg(j/-<s) 

a = l 

are approximations of Regularization Networks with ap- 
propriate additive stabilizers. The form of G depends 
on the stabilizer, and includes in particular cubic splines 
(used in typical implementations of PPR) and one- 
dimensional Gaussians. It seems, however, impossible 
to directly derive from regularization principles the sig- 
moidal activation functions used in Multilayer Percep- 
trons. We discuss in a simple example the close relation- 
ship between basis functions of the hinge, the sigmoid 
and the Gaussian type. 

The appendices deal with observations related to the 
main results of the paper and more technical details. 

2 The regularization approach to the 
approximation problem 

Suppose that the set g = {(x;, j/ 8 ) 6 R d x R}(L 1 of data 
has been obtained by random sampling of a function /, 
belonging to some space of functions X defined on R d , 
in the presence of noise, and suppose we are interested 
in recovering the function /, or an estimate of it, from 
the set of data g. This problem is clearly ill-posed, since 
it has an infinite number of solutions. In order to choose 
one particular solution we need to have some a priori 
knowedge of the function that has to be reconstructed. 
The most common form of a priori knowledge consists in 
assuming that the function is smooth, in the sense that 
two similar inputs correspond to two similar outputs. 
The main idea underlying regularization theory is that 
the solution of an ill-posed problem can be obtained from 
a variational principle, which contains both the data and 
prior smoothness information. Smoothness is taken into 
account by defining a smoothness functional <f>[f] in such 
a way that lower values of the functional correspond to 
smoother functions. Since we look for a function that 
is simultaneously close to the data and also smooth, it 
is natural to choose as a solution of the approximation 
problem the function that minimizes the following func- 
tional: 
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H[f] = J2(f(xi)-yi) 2 + ^[f] 



(i) 



where A is a positive number that is usually called the 
regularization parameter. The first term is enforcing 
closeness to the data, and the second smoothness, while 
the regularization parameter controls the tradeoff be- 
tween these two terms. 

It can be shown that, for a wide class of functionals <f>, 
the solutions of the minimization of the functional (1) all 
have the same form. Although a detailed and rigorous 
derivation of the solution of this problem is out of the 
scope of this memo, a simple derivation of this general 
result is presented in appendix (A). In this section we 
just present a family of smoothness functionals and the 
corresponding solutions of the variational problem. We 
refer the reader to the current literature for the mathe- 
matical details (Wahba, 1990; Madych and Nelson, 1990; 
Dyn, 1987). 

We first need to give a more precise definition of 
what we mean by smoothness and define a class of suit- 
able smoothness functionals. We refer to smoothness as 
a measure of the "oscillatory" behavior of a function. 
Therefore, within a class of differentiable functions, one 
function will be said to be smoother than another one if 
it oscillates less. If we look at the functions in the fre- 
quency domain, we may say that a function is smoother 
than another one if it has less energy at high frequency 
(smaller bandwidth). The high frequency content of a 
function can be measured by first high-pass filtering the 
function, and then measuring the power, that is the L^ 
norm, of the result. In formulas, this suggests defining 
smoothness functionals of the form: 
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l/(s)| 2 
G(s) 



(2) 



where ~ indicates the Fourier transform, G is some posi- 
tive function that falls off to zero as ||s|| — ► oo (so that ~ 
is an high-pass filter) and for which the class of functions 
such that this expression is well defined is not empty. For 
a well defined class of functions G (Madych and Nelson, 
1990; Dyn, 1991) this functional is a semi-norm, with a 
finite dimensional null space Af. The next section will 
be devoted to giving examples of the possible choices for 
the stabilizer <f>. For the moment we just assume that it 
can be written as in eq. (2), and make the additional as- 
sumption that G is symmetric, so that its Fourier trans- 
form G is real and symmetric. In this case it is possible 
to show (see appendix (A) for a sketch of the proof) 
that the function that minimizes the functional (1) has 
the form: 
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/(x) = ^2 c»G(x - x,) + ^2 d a ip a (x) 



(3) 



where {ip a } k a=l is a basis in the k dimensional null space 
Af and the coefficients d a and c; satisfy the following 
linear system: 



(G + A7)c + * T d = y 

*c = 

where I is the identity matrix, and we have defined 



(y).- 


= Vi , 


(c), = c t , (d), = di , 


(G),-; 


= G(x 8 - 


- Xj) , (*)„; = -lp a (xi) 



The existence of a solution to the linear system shown 
above is guaranteed by the existence of the solution of 
the variational problem. The case of A = corresponds 
to pure interpolation, and in this case the solvability of 
the linear system depends on the properties of the basis 
function G. 

The approximation scheme of eq. form (3) has a sim- 
ple interpretation in terms of a network with one layer 
of hidden units, which we call a Regularization Network 
(RN). Appendix B describes the simple extension to vec- 
tor output scheme. 

3 Classes of stabilizers 

In the previous section we considered the class of stabi- 
lizers of the form: 
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i/oo 



(4) 



Ir* G(s) 

and we have seen that the solution of the minimization 
problem always has the same form. In this section we 
discuss three different types of stabilizers belonging to 
the class (4), corresponding to different properties of the 
basis functions G. Each of them corresponds to differ- 
ent a priori assumptions about the smoothness of the 
function that must be approximated. 

3.1 Radial stabilizers 

Most of the commonly used stabilizers have radial sim- 
metry, that is, they satisfy the following equation: 

«/>[/(x)] = <f>[f(Rx)] 

for any rotation matrix R. This choice reflects the a 
priori assumption that all the variables have the same 
relevance, and that there are no priviliged directions. 
Rotation invariant stabilizers correspond clearly to ra- 
dial basis function G(||x||). Much attention has been 
dedicated to this case, and the corresponding approx- 
imation technique is known as Radial Basis Functions 
(Micchelli, 1986; Powell, 1987). The class of admissible 
Radial Basis Functions is the class of conditionally pos- 
itive definite functions of any order, since it has been 
shown (Madych and Nelson, 1991; Dyn, 1991) that in 
this case the functional of eq. (4) is a semi-norm, and 
the associated variational problem is well defined. All 
the Radial Basis Functions can therefore be derived in 
this framework. We explicitly give two important exam- 
ples. 

Duchon multivariate splines 



Duchon (1977) considered measures of smoothness of the 
form 



</>[/]= / ds ||s|H/(s)| 2 . 
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In this case G(s) = ,, ,| 2m and the corresponding basis 
function is therefore 



G(x) 



\2m-d 
\2m — d 



In ||x|| if 2m > d and d is even 
otherwise. 



(5) 

In this case the null space of cf)[f] is the vector space 
of polynomials of degree at most m in d variables, whose 
dimension is 



l + m-1 



These basis functions are radial and conditionally pos- 
itive definite, so that they represent just particular in- 
stances of the well known Radial Basis Functions tech- 
nique (Micchelli, 1986; Wahba, 1990). In two dimen- 
sions, for m = 2, eq. (5) yields the so called "thin 
plate" basis function G(x) = ||x|| 2 In ||x|| (Harder and 
Desmarais, 1972), depicted in figure (1). 

The Gaussian 

A stabilizer of the form 
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, HSU' 
is e f> 



l/(s)| 2 



||S|| 2 

where /3 is a fixed positive parameter, has G(s) = e f 
and as basis function the Gaussian function, represented 
in figure (2). The Gaussian function is positive definite, 
and it is well known from the theory of reproducing ker- 
nels that positive definite functions can be used to de- 
fine norms of the type (4). Since <f>[f] is a norm, its null 
space contains only the zero element, and the additional 
null space terms of eq. (3) are not needed, unlike in 
Duchon splines. A disadvantage of the Gaussian is the 
appearance of the scaling parameter /?, while Duchon 
splines, being homogeneous functions, do not depend on 
any scaling parameter. However, it is possible to devise 
good heuristics that furnish sub-optimal, but still good, 
values of /?, or good starting points for cross-validation 
procedures. 

Other Basis Functions 

Here we give a list of other functions that can be used as 
basis functions in the Radial Basis Functions technique, 
and that are therefore associated with the minimization 
of some functional. In the following table we indicate as 
"p.d." the positive definite functions, which do not need 
any polynomial term in the solution, and as "c.p.d. k" 
the conditionally positive definite functions of order k, 
which need a polynomial of degree k in the solution. 



G(r) = e l3r Gaussian, p.d. 

G(r) = \/r 2 + c 2 multiquadric, c.p.d. 1 



G(r) 



inverse multiquadric, p.d. 



Vc 2 +r 2 

G(r) = r 2n+1 multivariate splines, c.p.d. n 

G(r) = r 2n lnr multivariate splines, c.p.d. n 

3.2 Tensor product stabilizers 

An alternative to choosing a radial function G in the 
stabilizer (4) is a tensor product type of basis function, 
that is a function of the form 



G(s) = n? =1 flf( s> ) 



(6) 



where Sj is the j-th coordinate of the vector s, and g 
is an appropriate one-dimensional function. When g is 
positive definite the functional <f>[f] is clearly a norm and 
its null space is empty. In the case of a conditionally 
positive definite function the structure of the null space 
can be more complicated and we do not consider it here. 
Stabilizers with G(s) as in equation (6) have the form 
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which leads to a tensor product basis function 

G(x) = 11j =1 g(x j ) 

where Xj is the j-th coordinate of the vector x and g(x) 
is the Fourier transform of g(s). An interesting example 
is the one corresponding to the choice: 



»(*) 



1 



1 + s 2 
which leads to the basis function: 



G(x) = m 



-\x,\ 



■ELm 



This basis function is interesting from the point of view 
of VLSI implementations, because it requires the com- 
putation of the L\ norm of the input vector x, which 
is usually easier to compute than the Euclidean norm 
L'2- However, this basis function in not very smooth, 
as shown in figure (3), and its performance in practical 
cases should first be tested experimentally. 

We notice that the choice 



leads again to the Gaussian basis function G(x) 

o-llxll 2 



3.3 Additive stabilizers 

We have seen in the previous section how some tensor 
product approximation schemes can be derived in the 
framework of regularization theory. We now will see that 
is also possible to derive the class of additive approxima- 
tion schemes in the same framework, where by additive 
approximation we mean an approximation of the form 



/(x) = £/„(*") (7) 

where x^ is the //-th component of the input vector x 
and the f^ are one-dimensional functions that will be 
defined as the additive components of / (from now on 
Greek letter indices will be used in association with com- 
ponents of the input vectors). Additive models are well 
known in statistics (see Hastie and Tibshirani's book, 
1990) and can be consider as a generalization of linear 
models. They are appealing because, being essentially a 
superposition of one-dimensional functions, they have a 
low complexity, and they share with linear models the 
feature that the effects of the different variables can be 
examined separately. 

The simplest way to obtain such an approximation 
scheme is to choose a stabilizer that corresponds to an 
additive basis function (see fig. 4 for an example): 



G(x) 
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0,g(x") 



(8) 



where 9 ^ are certain fixed parameters. Such a choice, in 
fact, leads to an approximation scheme of the form (7) 
in which the additive components f^ have the form: 
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Ux) = e,J2 c i G ( x "- x i) 



(9) 



Notice that the additive components are not independent 
at this stage, since there is only one set of coefficients c 8 -. 
We postpone the discussion of this point to section (4.2). 
We would like to write stabilizers corresponding to 
the basis function (8) in the form (4), where G(s) is the 
Fourier transform of G(x). We notice that the Fourier 
transform of an additive function like the one in equation 
(8) is a distribution. For example, in two dimensions we 
obtain 



In the limit of e going to zero the denominator in ex- 
pression (11) approaches eq. (10), and the basis func- 
tion (12) approaches a basis function that is the sum of 
one-dimensional basis functions. In this paper we do not 
discuss this limit process in a rigorous way. Instead we 
outline another way to obtain additive approximations 
in the framework of regularization theory. 

Let us assume that we know a priori that the function 
/ that we want to approximate is additive, that is: 



/(x) = £/„(*") 



We then apply the regularization approach and impose a 
smoothness constraint, not on the function / as a whole, 
but on each single additive component, through a regu- 
larization functional of the form: 



N d d 1 . 

H[f] = J2^-J2M x ^ 2 + x J2r ds 
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where 9 ^ are given positive parameters which allow us to 
impose different degrees of smoothness on the different 
additive components. The minimizer of this functional 
is found with the same technique described in appendix 
(A), and skipping null space terms, it has the usual form 



N 



/( x ) = ^2ciG(x-Xi) 



(13) 



where 



G(x-x 8 ) = ]T^(^-<), 

as in eq. (8). 

We notice that the additive component of eq. (13) 
can be written as 
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where we have defined 



G(s) = 9 x g{s x )h{s y ) + 9 y g(s y )6(s x ) (10) 

and the interpretation of the reciprocal of this expression 
is delicate. However, almost additive basis functions can 
be obtained if we approximate the delta functions in eq. 
(10) with Gaussians of very small variance. Consider, 
for example in two dimensions, the stabilizer: 
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(ii) 



)R* 9 x g(s x )e ( ^ }2 +0 y g(s y )e ( "? ) 2 

This corresponds to a basis function of the form: 

G(x, y) = e x g(x)e-^y 2 + y g{y)e-<?** . (12) 



The additive components are therefore not independent 
because the parameters 9^ are fixed. If the 9 ^ were free 
parameters, the coefficients cf would be independent, as 
well as the additive components. 

Notice that the two ways we have outlined for deriv- 
ing additive approximation from regularization theory 
are equivalent. They both start from a prior assumption 
of additivity and smoothness of the class of functions 
to be approximated. In the first technique the two as- 
sumptions are both in the choice of the stabilizer, (eq. 
11); in the second they are made explicit and exploited 
sequentially. 



4 Extensions: from Regularization 
Networks to Generalized 
Regularization Networks 

In this section we will first review some extensions of reg- 
ularization networks, and then will apply them to Radial 
Basis Functions and to additive splines. 

A fundamental problem in almost all practical appli- 
cations in learning and pattern recognition is the choice 
of the relevant variables. It may happen that some of 
the variables are more relevant than others, that some 
variables are just totally irrelevant, or that the relevant 
variables are linear combinations of the original ones. 
It can therefore be useful to work not with the original 
set of variables x, but with a linear transformation of 
them, Wx, where W is a possibily rectangular matrix. 
In the framework of regularization theory, this can be 
taken into account by making the assumption that the 
approximating function / has the form /(x) = F(Wx) 
for some smooth function F. The smoothness assump- 
tion is now made directly on F , through a smoothness 
functional 4>[F] of the form (4). The regularization func- 
tional is now expressed in terms of F as 



H[F] 
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E 
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(m-Fiz^f + X^F] 



where z; = Wx; . The function that minimizes this func- 
tional is clearly, accordingly to the results of section (2), 
of the form: 



N 



F(z) = J2aG(z-z i ) 



(plus eventually a polynomial in z). Therefore the solu- 
tion for / is: 

N 

/(x) = f(Wx) = ^c,G(Wx - Wx,-) (14) 

8 = 1 

This argument is exact for given and known W, as in 
the case of classical Radial Basis Functions. Usually the 
matrix W is unknown, and it must be estimated from 
the examples. Estimating both the coefficients c; and 
the matrix W by least squares is probably not a good 
idea, since we would end up trying to estimate a num- 
ber of parameters that is larger than the number of data 
points (though one may use regularized least squares). 
Therefore, it has been proposed to replace the approxi- 
mation scheme of eq. (14) with a similar one, in which he 
basic shape of the approximation scheme is retained, but 
the number of basis functions is decreased. The result- 
ing approximating function that we call the Generalized 
Regularization Network (GRN) is: 



/(x) 



Z^ 



,G(Wx-Wt a ) 



(15) 



where n < N and the centers t a are chosen according to 
some heuristic (Moody and Darken, 1989), or are consid- 
ered as free parameters (Poggio and Girosi, 1989, 1990). 



The coefficients c a and the elements of the matrix W 
are estimated accordingly to a least squares criterion. 
The elements of the matrix W could also be estimated 
through cross-validation, which may be a formally more 
appropriate technique. 

In the special case in which the matrix W and the 
centers are kept fixed, the resulting technique is one orig- 
inally proposed by Broomhead and Lowe (1988), and the 
coefficients satisfy the following linear equation: 

G T Gc = G T y , 

where we have defined the following vectors and matri- 
ces: 



(y).- 



Vi 



(C)c 



(G) ia = G(x,- - t«) 



This technique, which has become quite common in the 
neural network community, has the advantage of retain- 
ing the form of the regularization solution, while being 
less complex to compute. A complete theoretical analy- 
sis has not yet been given, but some results, in the case 
in which the matrix W is set to identity, are already 
available (Sivakumar and Ward, 1991). 

The next sections discuss approximation schemes of 
the form (15) in the cases of radial and additive basis 
functions. 

4.1 Extensions of Radial Basis Functions 

In the case in which the basis function is radial, the 
approximation scheme of eq. (15) becomes: 
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c„G(||x - t„|| w ) 



(16) 



where we have defined the weighted norm: 

||x|| w =x- W T Wx . 

The basis functions of eq. (15) are not radial anymore, 
or, more accurately, they are radial in the metric defined 
by eq. (16). This means that the level curves of the basis 
functions are not circles, but ellipses, whose axes do not 
need to be aligned with the coordinate axis. Notice that 
in this case what is important is not the matrix W itself, 
but rather the product matrix W T W. Therefore, by the 
Cholesky decomposition, it is sufficient to take W upper 
triangular. The approximation scheme defined by eq. 
(15) has been discussed in detail in (Poggio and Girosi, 
1990; Girosi, 1992), so we do will not discuss it further, 
and will consider, in the next section, its analogue in the 
case of additive basis functions. 

4.2 Extensions of additive splines 

In the previous sections we have seen an extension of 
the classical regularization technique. In this section we 
derive the form that this extension takes when applied 
to additive splines. The resulting scheme is very similar 
to Projection Pursuit Regression (Friedman and Stuezle, 
1981; Huber, 1985). 

We start from the "classical" additive spline, derived 
from regularization in section (3.3): 
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/(x) = S>S>„G(z"-*n 



(17) 



In this scheme the smoothing parameters 6^ should be 
known, or can be estimated by cross-validation. An al- 
ternative to cross-validation is to consider the param- 
eters 6ft as free parameters, and estimate them with a 
least square technique together with the coefficients c;. 
If the parameters 6^ are free, the approximation scheme 
of eq. (17) becomes the following: 

N d 

where the coefficients cf are now independent. Of 
course, now we must estimate N x d coefficients instead 
of just N , and we are likely to encounter the overfitting 
problem. We then adopt the same idea presented in sec- 
tion (4), and consider an approximation scheme of the 
form 



n d 

/( x ) = EE c « G (^-'«) 

a = l jU = l 



(18) 



in which the number of centers is smaller than the num- 
ber of examples, reducing the number of coefficients that 
must be estimated. We notice that eq. (18) can be writ- 
ten as 



where each additive component has the form: 



/,(/) = v c jG(/-i:) 



a = l 



Therefore another advantage of this technique is that 
the additive components are now independent, each of 
them being a one-dimensional Radial Basis Functions. 

We can now use the same argument from section (4) to 
introduce a linear transformation of the inputs x — ► Wx, 
where W is a d' x d matrix. Calling w^ the //-th column 
of W , and performing the substitution x — ► Wx in eq. 
(18), we obtain 



n d' 

/( x ) = EE c « g ( w m- x -^) 

a = l jU = l 



(19) 



We now define the following one-dimensional function: 



a = l 

and rewrite the approximation scheme of eq: (19) as 



d' 
/( x ) = J2 h f ( - w f ' x ) 

8 = 1 



(20) 



Notice the similarity between eq. (20) and the Projec- 
tion Pursuit Regression technique: in both schemes the 
unknown function is approximated by a linear superposi- 
tion of one-dimensional variables, which are projections 
of the original variables on certain vectors that have been 
estimated. In Projection Pursuit Regression the choice 
of the functions hk(y) is left to the user. In our case the 
hk are one-dimensional Radial Basis Functions, for ex- 
ample cubic splines, or Gaussians. The choice depends, 
strictly speaking, on the specific prior, that is, on the 
specific smoothness assumptions made by the user. In- 
terestingly, in many applications of Projection Pursuit 
Regression the functions hk have been indeed chosen to 
be cubic splines. 

Let us briefly review the steps that bring us from the 
classical additive approximation scheme of eq. (9) to 
a Projection Pursuit Regression-like type of approxima- 
tion: 

1. the regularization parameters 6^ of the classical ap- 
proximation scheme (9) are considered as free pa- 
rameters; 

2. the number of centers is chosen to be smaller than 
the number of data points; 

3. it is assumed that the true relevant variables are 
some unknown linear combination of the original 
variables; 

We notice that in the special case in which each addi- 
tive component has just one center (n = 1), the approx- 
imation scheme of eq. (19) becomes: 
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(21) 



If the basis function G were a sigmoidal function this 
would be clearly a standard Multilayer Perceptron with 
one layer of hidden units. Sigmoidal functions cannot 
be derived from regularization theory, but we will see in 
section (6) the relationship between a sigmoidal function 
and a basis function that can be derived from regular- 
ization, like the absolute value function. 

There are clearly a number of computational issues re- 
lated to how to find the parameters of an approximation 
scheme like the one of eq. (19), but we do not discuss 
them here. We present instead, in section (7), some ex- 
perimental results, and will describe the algorithm used 
to obtain them. 

5 Priors, stabilizers and basis functions 

It is well known that a variational principle such as equa- 
tion (1) can be derived not only in the context of func- 
tional analysis (Tikhonov and Arsenin, 1977), but also in 
a probabilistic framework (Marroquin et al., 1987; Bert- 
ero et al., 1988, Wahba, 1990). In this section we illus- 
trate this connection informally, without addressing the 
several deep mathematical issues of the problem. 

Suppose that the set g = {(x;,?/;) 6 R" x R}fLi of 
data has been obtained by random sampling a function 
/, defined on R n , in the presence of noise, that is 



/(x,-) = !/,- + e,-, i=l,...,N (22) 

where e 8 - are random independent variables with a given 
distribution. We are interested in recovering the func- 
tion /, or an estimate of it, from the set of data g. We 
take a probabilistic approach, and regard the function / 
as the realization of a random field with a known prior 
probability distribution. Let us define: 

- Pf/lfif] as the conditional probability of the function 
/ given the examples g. 

- V [g\f] as the conditional probability of g given /. If 
the function underlying the data is /, this is the prob- 
ability that by random sampling the function / at the 
sites {xi}^ the set of measurement {yi}f = i is obtained, 
being therefore a model of the noise. 

- V[f]: is the a prion probability of the random field /. 
This embodies our a prion knowledge of the function, 
and can be used to impose constraints on the model, 
assigning significant probability only to those functions 
that satisfy those constraints. 

Assuming that the probability distributions Pfffl/] 
and V[f] are know, the posterior distribution Pf/lfif] can 
now be computed by applying the Bayes rule: 



N 

#[/] = X>.--/(x,-)) 2 + ^[/]. 

8 = 1 

where A = 2a 2 a. This functional is the same as that 
of eq. (1), and from here it is clear that the parameter 
A, that is usually called the "regularization parameter" 
determines the trade-off between the level of the noise 
and the strength of the a priori assumptions about the 
solution, therefore controlling the compromise between 
the degree of smoothness of the solution and its closeness 
to the data. 

As we have pointed out (Poggio and Girosi, 1989), 
prior probabilities can also be seen as a measure of com- 
plexity, assigning high complexity to the functions with 
small probability. It has been proposed by Rissanen 
(1978) to measure the complexity of a hypothesis in 
terms of the bit length needed to encode it. It turns 
out that the MAP estimate mentioned above is closely 
related to the Minimum Description Length Principle: 
the hypothesis / which for given g can be described in 
the most compact way is chosen as the "best" hypothe- 
sis. Similar ideas have been explored by others (for in- 
stance Solomonoff in 1978). They connect data compres- 
sion and coding with Bayesian inference, regularization, 
function approximation and learning. 



V[f\g]<xV[g\f]V[f]. (23) 

We now make the assumption that the noise variables 
in eq. (22) are normally distributed, with variance a. 
Therefore the probability Pfffl/] can be written as: 

Half] oc e-^^i (yi - /(Xi))a 

where a is the variance of the noise. 

The model for the prior probability distribution V[f] 
is chosen in analogy with the discrete case (when the 
function / is defined on a finite subset of a n-dimensional 
lattice) for which the problem can be rigorously formal- 
ized (Marroquin et al., 1987). The prior probability V[f] 
is written as 

V[f] oc e -°*] 

where <f>[f] is a smoothness functional of the type de- 
scribed in section (3) and a a positive real number. This 
form of probability distribution gives high probability 
only to those functions for which the term <f>[f] is small, 
and embodies the a priori knowledge that one has about 
the system. 

Following the Bayes rule (23) the a posteriori proba- 
bility of / is written as 



V[f\g]<xe 



-^[E^i(»'--f( x ')) a +2««' a *[/]] 



(24) 



One simple estimate of the function / from the prob- 
ability distribution (24) is the so called MAP (Maximum 
A Posteriori) estimate, that considers the function that 
maximizes the a posteriori probability Pf/lfif], or mini- 
mizes the exponent in equation (24). The MAP estimate 
of/ is therefore the minimizer of the following functional: 



5.1 The Bayesian interpretation of Generalized 
Regularization Networks 

In the probabilistic interpretation of standard regulariza- 
tion the term A<^[/] in the regularization functional cor- 
responds to the following prior probability in a Bayesian 
formulation in which the MAP estimate is sought: 



V[f] 



-A</>[/] 



From this point of view, the extension of section (4) cor- 
responds (again informally) to choose an a priori prob- 
ability of the form 

7>[/]oc / ^e- A *^(/(x)- ff (Wx)) 



where 6g means that a functional integration is being 
performed. This restricts the space of functions on which 
the probability distribution is defined to the class of func- 
tions that can be written as /(x) = j(Wx), and assume 
a prior probability distribution e~ x ^ a ^ ^' for the func- 
tions g, where <f> is now a radially symmetric stabilizer. 
In a similar manner, in the case of additive approxi- 
mation the prior probability of / is concentrated on those 
functions / that can be written as sums of additive com- 
ponents, and corresponding priors are of the form: 



V[f]<x / 8h...8f d U 



-4>[f„] 
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This is equivalent to saying that we know a priori that 
the underlying function is additive. 



6 Additive splines, hinge functions, 
sigmoidal neural nets 

In the previous sections we have shown how to extend 
RN to schemes that we have called GRN, which include 
ridge approximation schemes of the PPR type, that is 

d 1 

/w =y]M w M - x ) > 



=1 



where 



MlO = I>£G(!/ - i£). 



The form of the basis function G depends on the sta- 
bilizer, and a list of "admissible" G has been given in 
section (3). These include the absolute value G(x) = \x\ 
- corresponding to piecewise linear splines, and the func- 
tion G(x) = \x\ 3 - corresponding to cubic splines (used 
in typical implementations of PPR), as well as Gaussian 
functions. Though it may seem natural to think that sig- 
moidal multilayer perceptrons may be included in this 
framework, it is actually impossible to derive directly 
from regularization principles the sigmoidal activation 
functions typically used in Multilayer Perceptrons. In 
the following section we show, however, that there is a 
close relationship between basis functions of the hinge, 
the sigmoid and the Gaussian type. 

6.1 From additive splines to ramp and hinge 
functions 

We will consider here the one-dimensional case. Mul- 
tidimensional additive approximations consist of one- 
dimensional terms (once the W has been fixed). We 
consider the approximation with the lowest possible de- 
gree of smoothness: piecewise linear. The associated 
basis function G(x) = \x\ is shown in figure 5 top left, 
and the associated stabilizer is given by 



m 



l/MI 5 



Its use in approximating a one-dimensional function con- 
sists of the linear combination with appropriate coeffi- 
cients of translates of \x\. It is easy to see that a lin- 
ear combination of two translates of \x\ with appropri- 
ate coefficients (positive and negative and equal in ab- 
solute value) yields the piecewise linear threshold func- 
tion <Jl{x) shown in figure 5. Linear combinations of 
translates of such functions can be used to approximate 
one-dimensional functions. A similar derivative-like, lin- 
ear combination of two translates of <Jl{x) functions with 
appropriate coefficients yields the Gaussian-like function 
gh{x) also shown in figure 5. Linear combinations of 
translates of this function can also be used for approxi- 
mation of a function. Thus any given approximation in 
terms of gL{x) can be rewritten in terms of <Jl{x) and 
the latter can be in turn expressed in terms of the basis 
function \x\. 

Notice that the basis functions \x\ underlie the 
"hinge" technique proposed by Breiman (1992), whereas 



the basis functions <Jl{x) are sigmoidal- like and the 
gh{x) are Gaussian-like. The arguments above show the 
close relations between all of them, despite the fact that 
only | a; | is strictly a "legal" basis function from the point 
of view of regularization (<7l(:e) is not, though the very 
similar but smoother Gaussian is). Notice also that \x\ 
can be expressed in terms of "ramp" functions, that is 

\x\ = X + + X-. 

These relationships imply that it may be interesting 
to compare how well each of these basis functions is able 
to approximate some simple function. To do this we used 
the model f(x) = ^2 a c a G(x — t a ) to approximate the 
function h(x) = sin(27r;c) on [0, 1], where G(x) is one of 
the basis functions of figure 5. The function sin(27r;c) 
is plotted in figure 6. Fifty training points and 10,000 
test points were chosen uniformly on [0, 1]. The param- 
eters were learned using the iterative backfitting algo- 
rithm that will be described in section 7. We looked at 
the function learned after fitting 1, 2, 4, 8 and 16 basis 
functions. The resulting approximations are plotted in 
the following figures and the errors are summarized in 
table 1. 

The results show that the performance of all three 
basis functions is fairly close as the number of basis 
functions increases. All models did a good job of ap- 
proximating sin(27r;c). The absolute value function did 
slightly better and the "Gaussian" function did slightly 
worse. It is interesting that the approximation using two 
absolute value functions is almost identical to the ap- 
proximation using one "sigmoidal" function which again 
shows that two absolute value basis functions can sum 
to equal one "sigmoidal" piecewise linear function. 

7 Numerical illustrations 

7.1 Comparing additive and non-additive 
models 

In order to illustrate some of the ideas presented in this 
paper and to provide some practical intuition about the 
various models, we present numerical experiments com- 
paring the performance of additive and non-additive net- 
works on two-dimensional problems. In a model consist- 
ing of a sum of two-dimensional Gaussians, the model 
can be changed from a non-additive Radial Basis Func- 
tion network to an additive network by "elongating" the 
Gaussians along the two coordinate axes. This allows us 
to measure the performance of a network as it changes 
from a non-additive scheme to an additive one. 

Five different models were tested. The first three dif- 
fer only in the variances of the Gaussian along the two 
coordinate axes. The ratio of the x variance to the y vari- 
ance determines the elongation of the Gaussian. These 
models all have the same form and can be written as: 

N 

/(x) = ^[dtx - x,-) + G 2 (x - x,-)] 

8 = 1 

where 



d = e 



-(#+#) 



and 



G 2 



_(!± + a_) 



The models differ only in the values of a\ and C2. For 
the first model, a\ = .5 and it 2 = .5 (RBF), for the 
second model a\ = 10 and <72 = .5 (elliptical Gaussian), 
and for the third model, a\ = oo and <72 = .5 (additive). 
These models correspond to placing two Gaussians at 
each data point x;, with one Gaussian elongated in the 
x direction and one elongated in the y direction. In the 
first case (RBF) there is no elongation, in the second case 
(elliptical Gaussian) there is moderate elongation, and 
in the last case (additive) there is infinite elongation. In 
these three models, the centers were fixed in the learning 
algorithm and equal to the training examples. The only 
parameters that were learned were the coefficients c;. 

The fourth model is an additive model of the form 
(18), in which the number of centers is smaller than the 
number of data points, but the additive components are 
independent, and can be written as: 



fix, y ) = J2 h -°( x - o + E c ? G (y ~ *y ) 

a = l /3 = 1 

where the basis function is the Gaussian: 



G{x) 



-2x' 



In this model, the centers were also fixed in the learn- 
ing algorithm, and were a proper subset of the training 
examples, so that there were fewer centers than exam- 
ples. In the experiments that follow, 7 centers were used 
with this model, and the coefficients b a and c a were de- 
termined by least squares. 

The fifth model is a Generalized Regularization Net- 
work model, of the form (21), that uses a Gaussian basis 
function: 



/(x) 



E^^ (W °' X "'" ) 



In this model the weight vectors, centers, and coeffi- 
cients are all learned. 

The coefficients of the first four models were set by 
solving the linear system of equations by using the 
pseudo-inverse, which finds the best mean squared fit 
of the linear model to the data. 

The fifth model was trained by fitting one basis func- 
tion at a time according to the following algorithm: 

• Add a new basis function; 

• Optimize the parameters w„, t a and c a using the 
random step algorithm (described below); 

• Backfitting: for each basis function a added so far: 

— hold the parameters of all other functions 
fixed; 

— reoptimize the parameters of function a; 

• Repeat the backfitting stage until there is no sig- 
nificant decrease in L^ error. 



The random step algorithm (Caprile and Girosi, 1990) 
for optimizing a set of parameters works as follows. Pick 
random changes to each parameter such that each ran- 
dom change lies within some interval [a,b]. Add the 
random changes to each parameter and then calculate 
the new error between the output of the network and 
the target values. If the error decreases, then keep the 
changes and double the length of the interval for pick- 
ing random changes. If the error increases, then throw 
out the changes and halve the size of the interval. If the 
length of the interval becomes less than some threshold, 
then reset the length of the interval to some larger value. 

The five models were each tested on two different func- 
tions: a two-dimensional additive function: 

h(x, y) = sin(27r;c) + 4(y - 0.5) 2 
and the two-dimensional Gabor function: 

g(x, y) = e~" x " cos(.757r(;c + y)). 

The graphs of these functions are shown in figure 
10. The training data for the additive function con- 
sisted of 20 points picked from a uniform distribution 
on [0, 1] x [0, 1]. Another 10,000 points were randomly 
chosen to serve as test data. The training data for the 
Gabor function consisted of 20 points picked from a uni- 
form distribution on [—1,1] x [—1,1] with an additional 
10,000 points used as test data. 

In order to see how sensitive were the performances 
to the choice of basis function, we also repeated the ex- 
periments for the models 3, 4 and 5 with a sigmoid (that 
is not a basis function that can be derived from regular- 
ization theory) replacing the Gaussian basis function. In 
our experiments we used the standard sigmoid function: 



<j(x) 
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These models (6, 7 and 8) are shown in table 2 together 
with models 1 to 5. Notice that only model 8 is a Multi- 
layer Perceptron in the standard sense. The results are 
summarized in table 3. 

Plots of some of the approximations are shown in fig- 
ures 11, 12, 13 and 14. As expected, the results show 
that the additive model was able to approximate the ad- 
ditive function, h(x, y) better than both the RBF model 
and the elliptical Gaussian model. Also, there seems to 
be a smooth degradation of performance as the model 
changes from the additive to the Radial Basis Function 
(figure 11). Just the opposite results are seen in approx- 
imating the non-additive Gabor function, g(x,y). The 
RBF model did very well, while the additive model did 
a very poor job in approximating the Gabor function 
(figures 12 and 13a). However, we see that the GRN 
scheme (model 5), gives a fairly good approximation (fig- 
ure 13b). This is due to the fact that the learning al- 
gorithm was able to find better directions to project the 
data than the x and y axes as in the pure additive model. 
We can also see from table 3 that the additive model 
with fewer centers than examples (model 4) has a larger 
training error than the purely additive model 3, but a 
much smaller test error. The results for the sigmoidal 
additive model learning the additive function h (figure 



14) show that it is comparable to the Gaussian additive 
model. The first three models we considered had a num- 
ber of parameters equal to the number of data points, 
and were supposed to exactly interpolate the data, so 
that one may wonder why the training errors are not 
exactly zero. This is due to the ill-conditioning of the 
associated linear system, which is a common problem in 
Radial Basis Functions (Dyn, Levin and Rippa, 1986). 

8 Summary and remarks 

A large number of approximation techniques can be writ- 
ten as multilayer networks with one hidden layer, as 
shown in figure (16). In past papers (Poggio and Girosi, 
1989; Poggio and Girosi, 1990, 1990b; Maruyama, Girosi 
and Poggio, 1992) we showed how to derive RBF, HBF 
and several types of multidimensional splines from reg- 
ularization principles of the form used to deal with the 
ill-posed problem of function approximation. We had 
not used regularization to yield approximation schemes 
of the additive type (Wahba, 1990; Hastie and Tibshi- 
rani, 1990), such as additive splines, ridge approxima- 
tion of the PPR type and hinge functions. In this paper, 
we show that appropriate stabilizers can be defined to 
justify such additive schemes, and that the same exten- 
sions that leads from RBF to HBF leads from additive 
splines to ridge function approximation schemes of the 
Projection Pursuit Regression type. Our Generalized 
Regularization Networks include, depending on the sta- 
bilizer (that is on the prior knowledge on the functions 
we want to approximate), HBF networks, ridge approxi- 
mation and tensor products splines. Figure (15) shows a 
diagram of the relationships. Notice that HBF networks 
and Ridge Regression networks are directly related in 
the special case of normalized inputs (Maruyama, Girosi 
and Poggio, 1992). Also note that Gaussian HBF net- 
works, as described by Poggio and Girosi (1990) contain 
in the limit the additive models we describe here. 

We feel that there is now a theoretical framework that 
justifies a large spectrum of approximation schemes in 
terms of different smoothness constraints imposed within 
the same regularization functional to solve the ill-posed 
problem of function approximation from sparse data. 
The claim is that all the different networks and cor- 
responding approximation schemes can be justified in 
terms of the variational principle 



architecture will work best for the class of function de- 
fined by the associated prior (that is stabilizer), an ex- 
pectation which is consistent with numerical results (see 
our numerical experiments in this paper, and Maruyama 
et al. 1992; see also Donohue and Johnstone, 1989). 

Of the several points suggested by our results we will 
discuss one here: it regards the surprising relative suc- 
cess of additive schemes of the ridge approximation type 
in real world applications. 

As we have seen, ridge approximation schemes depend 
on priors that combine additivity of one-dimensional 
functions with the usual assumption of smoothness. Do 
such priors capture some fundamental property of the 
physical world? Consider for example the problem of 
object recognition, or the problem of motor control. We 
can recognize almost any object from any of many small 
subsets of its features, visual and non- visual. We can 
perform many motor actions in several different ways. In 
most situations, our sensory and motor worlds are redun- 
dant. In terms of GRN this means that instead of high- 
dimensional centers, any of several lower- dimensional 
centers are often sufficient to perform a given task. This 
means that the "and" of a high-dimensional conjunction 
can be replaced by the "or" of its components - a face 
may be recognized by its eyebrows alone, or a mug by 
its color. To recognize an object, we may use not only 
templates comprising all its features, but also subtem- 
plates, comprising subsets of features. Additive, small 
centers - in the limit with dimensionality one - with the 
appropriate W are of course associated with stabilizers 
of the additive type. 

Splitting the recognizable world into its additive parts 
may well be preferable to reconstructing it in its full mul- 
tidimensionality, because a system composed of several 
independently accessible parts is inherently more robust 
than a whole simultaneously dependent on each of its 
parts. The small loss in uniqueness of recognition is eas- 
ily offset by the gain against noise and occlusion. There 
is also a possible meta-argument that we report here only 
for the sake of curiosity. It may be argued that humans 
possibly would not be able to understand the world if 
it were not additive because of the too-large number of 
necessary examples (because of high dimensionality of 
any sensory input such as an image). Thus one may be 
tempted to conjecture that our sensory world is biased 
towards an "additive structure". 



N 



H[f] = J2(f(xi)-yif + ^[f] 



(25) 



They differ because of different choices of stabilizers 
<f>, which correspond to different assumptions of smooth- 
ness. In this context, we believe that the Bayesian inter- 
pretation is one of the main advantages of regularization: 
it makes clear that different network architectures cor- 
respond to different prior assumptions of smoothness of 
the functions to be approximated. 

The common framework we have derived suggests 
that differences between the various network architec- 
tures are relatively minor, corresponding to different 
smoothness assumptions. One would expect that each 



A Derivation of the general form of 
solution of the regularization 
problem 

We have seen in section (2) that the regularized solu- 
tion of the approximation problem is the function that 
minimizes a cost functional of the following form: 
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#[/] = I>.--/(x.-)) 2 + W] 



where the smoothness functional <f>[f] is given by 



(26) 
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For the smoothness functional 



Ir* G(s) 

The first term measures the distance between the data 
and the desired solution /, and the second term measures 
the cost associated with the deviation from smoothness. 
For a wide class of functionals <f> the solutions of the 
minimization problem (26) all have the same form. A 
detailed and rigorous derivation of the solution of the 
variational principle associated with eq. (26) is outside 
the scope of this paper. We present here a simple deriva- 
tion and refer the reader to the current literature for the 
mathematical details (Wahba, 1990; Madych and Nel- 
son, 1990; Dyn, 1987). 

We first notice that, depending on the choice of G, 
the functional <f>[f] can have a non-empty null space, 
and therefore there is a certain class of functions that 
are "invisible" to it. To cope with this problem we first 
define an equivalence relation among all the functions 
that differ for an element of the null space of <f>[f]. Then 
we express the first term of H[f] in terms of the Fourier 
transform of /: 

/(x) = C [ ds /(s)e ix - s 

JR d 

obtaining the functional 
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H[f] = Y J (y l -C f rfs/(s)e 8X ' s ) 2 +A / ds 

Then we notice that since / is real, its Fourier transform 
satisfies the constraint: 



r(s) = /(-s) 

so that the functional can be rewritten as: 



6 f is /( " s)/(s) 
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*( B -t) = 2^. 
Jr* G(s) G(t) 

Using these results we can now write eq. (27) as: 

f; (w -/(x.-)) e *- t + A^ = o. 

Changing t in — t and multiplying by G(t) on both sides 
of this equation we get: 
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f(t) = G(-t)J2 
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We now define the coefficients 
(Vi ~ /(x;)) 



A 



i = l,...,N , 



assume that G is symmetric (so that its Fourier trans- 
form is real), and take the Fourier transform of the last 
equation, obtaining: 
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/(x) = ^2 CiS(xi - x) * G(x) = ^2 CiG(x - x,) . 

8 = 1 8 = 1 

We now remember that we had defined as equivalent all 
the functions differing by a term that lies in the null 
space of <f>[f], and therefore the most general solution of 
the minimization problem is 



H[f] = Y,(m-C S d S f( S )e^y+x[ ds f( ^ f(s) 

~l JR d JR d G-(s) 

In order to find the minimum of this functional we take 
its functional derivatives with respect to /: 
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We now proceed to compute the functional derivatives 
of the first and second term of H[f]. For the first term 
we have: 
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/(x) = ^c,-G(x-x,-)+pW 

8 = 1 

where p(x) is a term that lies in the null space of <f>[f]. 

B Approximation of vector fields 

through multioutput regularization 
networks 

Consider the problem of approximating a vector field 
y(x) from a set of sparse data, the examples, which are 
pairs (y 8 ',Xj) for i = 1 • • -N. Choose a Generalized Reg- 
ularization Network as the approximation scheme, that 
is, a network with one "hidden" layer and linear output 
units. Consider the case of N examples, n < N centers, 
input dimensionality d and output dimensionality q (see 
figure 17). Then the approximation is 



y(x) 
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with G being the chosen Green function. The equation 
can be rewritten in matrix notation as 

y(x) = Cg(x) 



where g is the vector with elements g{ = G(x — x 8 ). 
Let us define as G the matrix of the chosen Green func- 
tion evaluated at the examples, that is, the matrix with 
elements Gij = G(x 8 - — Xj). Then the "weights" c are 
"learned" from the examples by solving 

Y = CG 

where Y is defined as the matrix in which column / is 
the example y/. C is defined as the matrix in which row 
m is the vector c m . This means that x is a d x 1 matrix, 
C is a q x n matrix, Y is a q x N matrix and G is a 
n x N matrix. Then the set of weights C is given by 

C = YG+ 

It also follows (though it is not so well known) that 
the vector field y is approximated by the network as the 
linear combination of the example fields y; , that is 

y(x) = YG+g(x) 
which can be rewritten as 



N 



y( x ) = J2 b 'W' y ' 



1=1 

where the 6; depend on the chosen G, according to 

b(x) = G+g(x) 

Thus for any choice of the regularization network - 
even HBF - and any choice of the Green function - in- 
cluding Green functions corresponding to additive splines 
and tensor product splines - the estimated output vec- 
tor is always a linear combination of example vectors 
with coefficients b that depend (nonlmearly) on the in- 
put value. The result is valid for all networks with one 
hidden layer and linear outputs, provided that a L 2 cri- 
terion is used for training. Thus, for all types of regu- 
larization networks and all Green functions the output 
is always a linear combination of output examples (see 
Poggio and Girosi 1989). 

References 

[1] A. R. Barron and Barron R. L. Statistical learning 
networks: a unifying view. In Symposium on the In- 
terface: Statistics and Computing Science, Reston, 
Virginia, April 1988. 

[2] M. Bertero, T. Poggio, and V. Torre. Ill-posed 
problems in early vision. Proceedings of the IEEE, 
76:869-889, 1988. 

[3] L. Breiman. Hinging hyperplanes for regression, 
classification, and function approximation. 1992. 
(submitted for publication). 

[4] D.S. Broomhead and D. Lowe. Multivariate func- 
tional interpolation and adaptive networks. Com- 
plex Systems, 2:321-355, 1988. 

[5] A. Buja, T. Hastie, and R. Tibshirani. Linear 
smoothers and additive models. The Annals of 
Statistics, 17:453-555, 1989. 



12 



[6] B. Caprile and F. Girosi. A nondeterministic min- 
imization algorithm. A.I. Memo 1254, Artificial 
Intelligence Laboratory, Massachusetts Institute of 
Technology, Cambridge, MA, September 1990. 

[7] D.L. Donoho and I.M. Johnstone. Projection-based 
approximation and a duality with kernel methods. 
The Annals of Statistics, 17(1):58-106, 1989. 

[8] J. Duchon. Spline minimizing rotation-invariant 
semi-norms in Sobolev spaces. In W. Schempp and 
K. Zeller, editors, Constructive theory of functions 
os several variables, lecture Notes in Mathematics, 
571. Springer- Verlag, Berlin, 1977. 

[9] N. Dyn. Interpolation of scattered data by radial 
functions. In C.K. Chui, L.L. Schumaker, and F.I. 
Utreras, editors, Topics in multivariate approxima- 
tion. Academic Press, New York, 1987. 

[10] N. Dyn. Interpolation and approximation by ra- 
dial and related functions. In C.K. Chui, L.L. Schu- 
maker, and D.J. Ward, editors, Approximation The- 
ory, VI, pages 211-234. Academic Press, New York, 
1991. 

[11] N. Dyn, D. Levin, and S. Rippa. Numerical proce- 
dures for surface fitting of scattered data by radial 
functions. SIAM J. Sci. Stat. Comput., 7(2):639- 
659, April 1986. 

[12] J.H. Friedman and W. Stuetzle. Projection pursuit 
regression. Journal of the American Statistical As- 
sociation, 76(376):817-823, 1981. 

[13] R.L. Harder and R.M. Desmarais. Interpolation us- 
ing surface splines. J. Aircraft, 9:189-191, 1972. 

[14] T. Hastie and R. Tibshirani. Generalized additive 
models. Statistical Science, 1:297-318, 1986. 

[15] T. Hastie and R. Tibshirani. Generalized additive 
models: some applications. J. Amer. Statistical As- 
soc, 82:371-386, 1987. 

[16] T. Hastie and R. Tibshirani. Generalized Additive 
Models, volume 43 of Monographs on Statistics and 
Applied Probability. Chapman and Hall, London, 
1990. 

[17] P.J. Huber. Projection pursuit. The Annals of 
Statistics, 13(2):435-475, 1985. 

[18] W.R. Madych and S.A. Nelson. Multivari- 

ate interpolation and conditionally positive defi- 
nite functions. II. Mathematics of Computation, 
54(189):211-230, January 1990. 

[19] J. L. Marroquin, S. Mitter, and T. Poggio. Prob- 
abilistic solution of ill-posed problems in computa- 
tional vision. J. Amer. Stat. Assoc, 82:76-89, 1987. 

[20] M. Maruyama, F. Girosi, and T. Poggio. A connec- 
tion between HBF and MLP. A.I. Memo No. 1291, 
Artificial Intelligence Laboratory, Massachusetts In- 
stitute of Technology, 1992. 

[21] C. A. Micchelli. Interpolation of scattered data: 
distance matrices and conditionally positive defi- 
nite functions. Constructive Approximation, 2:11- 
22, 1986. 



[22] J. Moody and C. Darken. Fast learning in networks 
of locally-tuned processing units. Neural Computa- 
tion, l(2):281-294, 1989. 

[23] T. Poggio and F. Girosi. A theory of networks for 
approximation and learning. A.I. Memo No. 1140, 
Artificial Intelligence Laboratory, Massachusetts In- 
stitute of Technology, 1989. 

[24] T. Poggio and F. Girosi. Networks for approxima- 
tion and learning. Proceedings of the IEEE, 78(9), 
September 1990. 

[25] T. Poggio and F. Girosi. Extension of a theory 
of networks for approximation and learning: di- 
mensionality reduction and clustering. In Proceed- 
ings Image Understanding Workshop, pages 597- 
603, Pittsburgh, Pennsylvania, September 11-13 
1990a. Morgan Kaufmann. 

[26] T. Poggio and F. Girosi. Regularization algorithms 
for learning that are equivalent to multilayer net- 
works. Science, 247:978-982, 1990b. 

[27] M. J. D. Powell. Radial basis functions for multi- 
variable interpolation: a review. In J. C. Mason and 
M. G. Cox, editors, Algorithms for Approximation. 
Clarendon Press, Oxford, 1987. 

[28] M.J.D. Powell. The theory of radial basis functions 
approximation in 1990. Technical Report NA11, De- 
partment of Applied Mathematics and Theoretical 
Physics, Cambridge, England, December 1990. 

[29] J. Rissanen. Modeling by shortest data description. 
Automatica, 14:465-471, 1978. 

[30] N. Sivakumar and J.D. Ward. On the best least 
square fit by radial functions to multidimensional 
scattered data. Technical Report 251, Center for 
Approximation Theory, Texas A & M University, 
June 1991. 

[31] R.J. Solomonoff. Complexity-based induction sys- 
tems: comparison and convergence theorems. IEEE 
Transactions on Information Theory, 24, 1978. 

[32] A. N. Tikhonov. Solution of incorrectly formulated 
problems and the regularization method. Soviet 
Math. DokL, 4:1035-1038, 1963. 

[33] A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill- 
posed Problems. W. H. Winston, Washington, D.C., 
1977. 

[34] G. Wahba. Splines Models for Observational Data. 
Series in Applied Mathematics, Vol. 59, SIAM, 
Philadelphia, 1990. 



13 





1 basis 
function 


2 basis 
functions 


4 basis 
functions 


8 basis 
functions 


16 basis 
functions 


Absolute value 


train: 0.798076 
test: 0.762225 


0.160382 
0.127020 


0.011687 
0.012427 


0.000555 
0.001179 


0.000056 
0.000144 


"Sigmoidal" 


train: 0.161108 
test: 0.128057 


0.131835 
0.106780 


0.001599 
0.001972 


0.000427 
0.000787 


0.000037 
0.000163 


"Gaussian" 


train: 0.497329 
test: 0.546142 


0.072549 
0.087254 


0.002880 
0.003820 


0.000524 
0.001211 


0.000024 
0.000306 



Table 1: L^ training and test error for each of the 3 piecewise linear models using different numbers of basis functions. 



Model 1 


( (x-x,f | (y-y,f\ ( (x-x,f | (y-y,f\ 


(7i = (72 = 0.5 


Model 2 


( (*-*,) 2 , (»-»,) 2 A fix-*,) 2 , (»-»,) 2 A 


CT 1 = 10, (7 2 = 0.5 


/(*,!/) = £i=i*[e ^ '* " )+e \ " - 1 )] 


Model 3 


r , . ^9n r (x-x t ) 2 (m-m,) 2 


(7=0.5 


f(x,y) = l^ i=1 Ci[e * +e - J 


Model 4 


,, , „7 , (-*S) 2 „7 (»-<2> 2 


(7 = 0.5 


f(x,y) = ^ a=1 b a e ° + 2^/3=1 c P e 


Model 5 


/(^2/) = E:=iC«e-( w - x -*^ 


- 


Model 6 


f( x , y) = ESi c i[ a ( x - x i) + a (y - Vi)] 


- 


Model 7 


f(x, V) = ELl h <*<r{x - t") + ELl c P a (v ~ tf) 


- 


Model 8 


f{ x , y) = Ea = l C«<T(w a -X-t a ) 


- 



Table 2: The eight models we tested in our numerical experiments. 





Model 1 


Model 2 


Model 3 


Model 4 


Model 5 


Model 6 


Model 7 


Model 8 


h(x,y) 


train: 0.000036 
test: 0.011717 


0.000067 
0.001598 


0.000001 
0.000007 


0.000001 
0.000009 


0.000170 
0.001422 


0.000001 
0.000015 


0.000003 
0.000020 


0.000743 
0.026699 


g{x,y) 


train: 0.000000 
test: 0.003818 


0.000000 
0.344881 


0.000000 
67.95237 


0.345423 
1.222111 


0.000001 
0.033964 


0.000000 
98.419816 


0.456822 
1.397397 


0.000044 
0.191055 



Table 3: A summary of the results of our numerical experiments. Each table entry contains the L^ errors for both 

the training set and the test set. 
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G(r) = r A 2 ln(r) 



U 




Figure 1: The "thin plate" radial basis function G(r) = r 2 ln(r), where r = ||x| 
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z = exp(- x A 2 - y A 2) 




Figure 2: The Gaussian basis function G(r) 



where 
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z = expO Ixl - lyl) 




Figure 3: The basis function G(x) 
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z = exp(-x A 2) 



z = exp(-y A 2) 





a) 



b) 



z = exp(-x A 2) + exp(-y A 2) 




c) 

Figure 4: In (c) it is shown an additive basis function, in the case in which the additive component of the basis 
functions (a and b) are gaussian. 
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Figure 5: a) Absolute value basis function, \x\, b) "Sigmoidal" basis function <Jl{x) c) Gaussian-like basis function 
9l{x) 




Figure 6: Sin(2irx) 
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Figure 7: a) Approximation using one absolute value basis function b) 2 basis functions c) 4 basis functions d) 8 
basis functions e) 16 basis functions 
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Figure 8: a) Approximation using one "sigmoidal" basis function b) 2 basis functions c) 4 basis functions d) 8 basis 
functions e) 16 basis functions 
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Figure 9: a) Approximation using one "Gaussian" basis function b) 2 basis functions c) 4 basis functions d) 8 basis 
functions e) 16 basis functions 
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Figure 10: a) Graph of h(x, y). b) Graph of g(x, y). 




Figure 11: a) RBF Gaussian model approximation of h(x, y) (model 1). b) Elliptical Gaussian model approximation 
of h(x, y) (model 2). c) Additive Gaussian model approximation of h(x, y) (model 3). 





Figure 12: a) RBF Gaussian model approximation of g(x, y) (model 1). b) Elliptical Gaussian model approximation 
of g(x,y) (model 2). 
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Figure 13: a) Additive Gaussian model approximation of g(x, y) (model 3). b) GRN Approximation of g(x, y) (model 
5)- 





Figure 14: a) Sigmoidal additive model approximation of h(x,y) (model 6). b) Sigmoidal additive model approxi- 
mation of h(x,y) using fewer centers than examples (model 7). c) Multilayer Perceptron approximation of h(x,y) 
(model 8). 
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Figure 15: Several classes of approximation schemes and associated network architectures can be derived from 
regularization with the appropriate choice of smoothness priors and corresponding stabilizers and Greens functions 
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Figure 16: The most general network with one hidden layer and scalar output. 
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Figure 17: The most general network with one hidden layer and vector output. Notice that this approximation 
of a q-dimensional vector field has in general fewer parameters than the alternative representation consisting of q 
networks with one-dimensional outputs. If the only free parameters are the weights from the hidden layer to the 
output (as for simple RBF with if n = N) the two representations are equivalent. 
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